Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version

Process Documents (Text Processing)

Synopsis

Generates word vectors from a text object.

Description

This operator uses one single TextObject as input for generating a term vector. The resulting exampleset will hence consist of only one single example. This makes this operator especially useful for applying a model on one single text. But since the SingleTextInputOperator even provides a parameter for specifying the text, this one is more appropriate if used by a program, where a TextObject might simply be constructed and passed to the process.

Input

  • word list

    The word list port.

  • documents (Collection)

    The documents port.

Output

  • example set (Data Table)

    The example set port.

  • word list

    The word list port.

Parameters

  • create_word_vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document. Range:
  • vector_creationSelect the schema for creating the word vector. Range:
  • add_meta_informationIf checked, available meta information of the text like filename, date is added as attribute. Range:
  • keep_textIf checked, the input text will be stored as a special String attribute with the role text. Range:
  • prune_methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified. Range:
  • prune_below_percentIgnore words that appear in less than this percentage of all documents. Range:
  • prune_above_percentIgnore words that appear in more than this percentage of all documents. Range:
  • prune_below_absoluteIgnore words that appear in less than that many documents. Range:
  • prune_above_absoluteIgnore words that appear in more than that many documents. Range:
  • prune_below_rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned. Range:
  • prune_above_rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned. Range:
  • datamanagementDetermines, how the data is represented internally. Range:
  • parallelize_vector_creationDetermines whether the execution of Vector Creation should be parallelized. Range: